Analyze the data and generate insights that could help Netflix in deciding which type of shows/movies to produce and how they can grow the business in different countries
This Exploratory Data Analysis is to practice Python skills learned till now on a structured data set including loading, inspecting, wrangling, exploring, and drawing conclusions from data. The notebook has observations with each step in order to explain thoroughly how to approach the data set. Based on the observation some questions also are answered in the notebook for the reference though not all of them are explored in the analysis.
a. How was it collected?
%%time
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
import pandas_profiling as prf
%matplotlib inline
Wall time: 5.99 ms
netflix = pd.read_csv(r"C:\Users\modem\Downloads\netflix_data.csv")
netflix
| Unnamed: 0 | show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | s1 | Movie | Dick Johnson Is Dead | Kirsten Johnson | NaN | United States | 25-Sep-21 | 2020 | PG-13 | 90 min | Documentaries | As her father nears the end of his life, filmm... |
| 1 | 1 | s2 | TV Show | Blood & Water | NaN | Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban... | South Africa | 24-Sep-21 | 2021 | TV-MA | 2 Seasons | International TV Shows, TV Dramas, TV Mysteries | After crossing paths at a party, a Cape Town t... |
| 2 | 2 | s3 | TV Show | Ganglands | Julien Leclercq | Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi... | NaN | 24-Sep-21 | 2021 | TV-MA | 1 Season | Crime TV Shows, International TV Shows, TV Act... | To protect his family from a powerful drug lor... |
| 3 | 3 | s4 | TV Show | Jailbirds New Orleans | NaN | NaN | NaN | 24-Sep-21 | 2021 | TV-MA | 1 Season | Docuseries, Reality TV | Feuds, flirtations and toilet talk go down amo... |
| 4 | 4 | s5 | TV Show | Kota Factory | NaN | Mayur More, Jitendra Kumar, Ranjan Raj, Alam K... | India | 24-Sep-21 | 2021 | TV-MA | 2 Seasons | International TV Shows, Romantic TV Shows, TV ... | In a city of coaching centers known to train I... |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 8802 | 8802 | s8803 | Movie | Zodiac | David Fincher | Mark Ruffalo, Jake Gyllenhaal, Robert Downey J... | United States | 20-Nov-19 | 2007 | R | 158 min | Cult Movies, Dramas, Thrillers | A political cartoonist, a crime reporter and a... |
| 8803 | 8803 | s8804 | TV Show | Zombie Dumb | NaN | NaN | NaN | 1-Jul-19 | 2018 | TV-Y7 | 2 Seasons | Kids' TV, Korean TV Shows, TV Comedies | While living alone in a spooky town, a young g... |
| 8804 | 8804 | s8805 | Movie | Zombieland | Ruben Fleischer | Jesse Eisenberg, Woody Harrelson, Emma Stone, ... | United States | 1-Nov-19 | 2009 | R | 88 min | Comedies, Horror Movies | Looking to survive in a world taken over by zo... |
| 8805 | 8805 | s8806 | Movie | Zoom | Peter Hewitt | Tim Allen, Courteney Cox, Chevy Chase, Kate Ma... | United States | 11-Jan-20 | 2006 | PG | 88 min | Children & Family Movies, Comedies | Dragged from civilian life, a former superhero... |
| 8806 | 8806 | s8807 | Movie | Zubaan | Mozez Singh | Vicky Kaushal, Sarah-Jane Dias, Raaghav Chanan... | India | 2-Mar-19 | 2015 | TV-14 | 111 min | Dramas, International Movies, Music & Musicals | A scrappy but poor boy worms his way into a ty... |
8807 rows × 13 columns
netflix_copy = netflix.copy()
netflix_copy.head()
| Unnamed: 0 | show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | s1 | Movie | Dick Johnson Is Dead | Kirsten Johnson | NaN | United States | 25-Sep-21 | 2020 | PG-13 | 90 min | Documentaries | As her father nears the end of his life, filmm... |
| 1 | 1 | s2 | TV Show | Blood & Water | NaN | Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban... | South Africa | 24-Sep-21 | 2021 | TV-MA | 2 Seasons | International TV Shows, TV Dramas, TV Mysteries | After crossing paths at a party, a Cape Town t... |
| 2 | 2 | s3 | TV Show | Ganglands | Julien Leclercq | Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi... | NaN | 24-Sep-21 | 2021 | TV-MA | 1 Season | Crime TV Shows, International TV Shows, TV Act... | To protect his family from a powerful drug lor... |
| 3 | 3 | s4 | TV Show | Jailbirds New Orleans | NaN | NaN | NaN | 24-Sep-21 | 2021 | TV-MA | 1 Season | Docuseries, Reality TV | Feuds, flirtations and toilet talk go down amo... |
| 4 | 4 | s5 | TV Show | Kota Factory | NaN | Mayur More, Jitendra Kumar, Ranjan Raj, Alam K... | India | 24-Sep-21 | 2021 | TV-MA | 2 Seasons | International TV Shows, Romantic TV Shows, TV ... | In a city of coaching centers known to train I... |
3.1 Understanding the dataset
netflix.shape
(8807, 13)
netflix.columns
Index(['Unnamed: 0', 'show_id', 'type', 'title', 'director', 'cast', 'country',
'date_added', 'release_year', 'rating', 'duration', 'listed_in',
'description'],
dtype='object')
netflix.describe()
| Unnamed: 0 | release_year | |
|---|---|---|
| count | 8807.000000 | 8807.000000 |
| mean | 4403.000000 | 2014.180198 |
| std | 2542.506244 | 8.819312 |
| min | 0.000000 | 1925.000000 |
| 25% | 2201.500000 | 2013.000000 |
| 50% | 4403.000000 | 2017.000000 |
| 75% | 6604.500000 | 2019.000000 |
| max | 8806.000000 | 2021.000000 |
netflix.describe(include='all')
| Unnamed: 0 | show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 8807.000000 | 8807 | 8807 | 8807 | 6173 | 7982 | 7976 | 8797 | 8807.000000 | 8803 | 8804 | 8807 | 8807 |
| unique | NaN | 8807 | 2 | 8804 | 4528 | 7692 | 746 | 1767 | NaN | 17 | 220 | 514 | 8775 |
| top | NaN | s1 | Movie | 15-Aug | Rajiv Chilaka | David Attenborough | United States | 1-Jan-20 | NaN | TV-MA | 1 Season | Dramas, International Movies | Paranormal activity at a lush, abandoned prope... |
| freq | NaN | 1 | 6131 | 2 | 19 | 19 | 2818 | 109 | NaN | 3207 | 1793 | 362 | 4 |
| mean | 4403.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2014.180198 | NaN | NaN | NaN | NaN |
| std | 2542.506244 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 8.819312 | NaN | NaN | NaN | NaN |
| min | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1925.000000 | NaN | NaN | NaN | NaN |
| 25% | 2201.500000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2013.000000 | NaN | NaN | NaN | NaN |
| 50% | 4403.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2017.000000 | NaN | NaN | NaN | NaN |
| 75% | 6604.500000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2019.000000 | NaN | NaN | NaN | NaN |
| max | 8806.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2021.000000 | NaN | NaN | NaN | NaN |
gives the count and unique values in each columns
netflix.sort_values(by=['release_year'],ascending=False).head(5)
| Unnamed: 0 | show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 693 | 693 | s694 | Movie | Ali & Ratu Ratu Queens | Lucky Kuswandi | Iqbaal Ramadhan, Nirina Zubir, Asri Welas, Tik... | NaN | 17-Jun-21 | 2021 | TV-14 | 101 min | Comedies, Dramas, International Movies | After his father's passing, a teenager sets ou... |
| 781 | 781 | s782 | Movie | Black Holes | The Edge of All We Know | Peter Galison | NaN | NaN | 2-Jun-21 | 2021 | TV-14 | 99 min | Documentaries | Follow scientists on their quest to understand... |
| 762 | 762 | s763 | Movie | Sweet & Sour | Lee Kae-byeok | Jang Ki-yong, Chae Soo-bin, Jung Soo-jung | South Korea | 4-Jun-21 | 2021 | TV-14 | 103 min | Comedies, International Movies, Romantic Movies | Faced with real-world opportunities and challe... |
| 763 | 763 | s764 | TV Show | Sweet Tooth | NaN | Nonso Anozie, Christian Convery, Adeel Akhtar,... | United States | 4-Jun-21 | 2021 | TV-14 | 1 Season | TV Action & Adventure, TV Dramas, TV Sci-Fi & ... | On a perilous adventure across a post-apocalyp... |
| 764 | 764 | s765 | Movie | Trippin' with the Kandasamys | Jayan Moodley | Jailoshini Naidoo, Maeshni Naicker, Madhushan ... | South Africa | 4-Jun-21 | 2021 | TV-14 | 94 min | Comedies, International Movies, Romantic Movies | To rekindle their marriages, best friends-turn... |
netflix.nunique()
Unnamed: 0 8807 show_id 8807 type 2 title 8804 director 4528 cast 7692 country 746 date_added 1767 release_year 74 rating 17 duration 220 listed_in 514 description 8775 dtype: int64
the above table gives unique values in each columns
netflix.corr()
| Unnamed: 0 | release_year | |
|---|---|---|
| Unnamed: 0 | 1.000000 | -0.246713 |
| release_year | -0.246713 | 1.000000 |
netflix.tail()
| Unnamed: 0 | show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 8802 | 8802 | s8803 | Movie | Zodiac | David Fincher | Mark Ruffalo, Jake Gyllenhaal, Robert Downey J... | United States | 20-Nov-19 | 2007 | R | 158 min | Cult Movies, Dramas, Thrillers | A political cartoonist, a crime reporter and a... |
| 8803 | 8803 | s8804 | TV Show | Zombie Dumb | NaN | NaN | NaN | 1-Jul-19 | 2018 | TV-Y7 | 2 Seasons | Kids' TV, Korean TV Shows, TV Comedies | While living alone in a spooky town, a young g... |
| 8804 | 8804 | s8805 | Movie | Zombieland | Ruben Fleischer | Jesse Eisenberg, Woody Harrelson, Emma Stone, ... | United States | 1-Nov-19 | 2009 | R | 88 min | Comedies, Horror Movies | Looking to survive in a world taken over by zo... |
| 8805 | 8805 | s8806 | Movie | Zoom | Peter Hewitt | Tim Allen, Courteney Cox, Chevy Chase, Kate Ma... | United States | 11-Jan-20 | 2006 | PG | 88 min | Children & Family Movies, Comedies | Dragged from civilian life, a former superhero... |
| 8806 | 8806 | s8807 | Movie | Zubaan | Mozez Singh | Vicky Kaushal, Sarah-Jane Dias, Raaghav Chanan... | India | 2-Mar-19 | 2015 | TV-14 | 111 min | Dramas, International Movies, Music & Musicals | A scrappy but poor boy worms his way into a ty... |
netflix.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 8807 entries, 0 to 8806 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 8807 non-null int64 1 show_id 8807 non-null object 2 type 8807 non-null object 3 title 8807 non-null object 4 director 6173 non-null object 5 cast 7982 non-null object 6 country 7976 non-null object 7 date_added 8797 non-null object 8 release_year 8807 non-null int64 9 rating 8803 non-null object 10 duration 8804 non-null object 11 listed_in 8807 non-null object 12 description 8807 non-null object dtypes: int64(2), object(11) memory usage: 894.6+ KB
(netflix.isnull().sum()/len(netflix))*100
Unnamed: 0 0.000000 show_id 0.000000 type 0.000000 title 0.000000 director 29.908028 cast 9.367549 country 9.435676 date_added 0.113546 release_year 0.000000 rating 0.045418 duration 0.034064 listed_in 0.000000 description 0.000000 dtype: float64
There were 29 % missing values in Director column, 9 % each in cast and country column
3.2 Pre Profiling
profile_before = prf.ProfileReport(netflix)
profile_before
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
profile_before.to_file(output_file="Netflix_Before_PreProcessing.html")
Export report to file: 0%| | 0/1 [00:00<?, ?it/s]
3.3 Preprocessing
netflix.head()
| Unnamed: 0 | show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | s1 | Movie | Dick Johnson Is Dead | Kirsten Johnson | NaN | United States | 25-Sep-21 | 2020 | PG-13 | 90 min | Documentaries | As her father nears the end of his life, filmm... |
| 1 | 1 | s2 | TV Show | Blood & Water | NaN | Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban... | South Africa | 24-Sep-21 | 2021 | TV-MA | 2 Seasons | International TV Shows, TV Dramas, TV Mysteries | After crossing paths at a party, a Cape Town t... |
| 2 | 2 | s3 | TV Show | Ganglands | Julien Leclercq | Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi... | NaN | 24-Sep-21 | 2021 | TV-MA | 1 Season | Crime TV Shows, International TV Shows, TV Act... | To protect his family from a powerful drug lor... |
| 3 | 3 | s4 | TV Show | Jailbirds New Orleans | NaN | NaN | NaN | 24-Sep-21 | 2021 | TV-MA | 1 Season | Docuseries, Reality TV | Feuds, flirtations and toilet talk go down amo... |
| 4 | 4 | s5 | TV Show | Kota Factory | NaN | Mayur More, Jitendra Kumar, Ranjan Raj, Alam K... | India | 24-Sep-21 | 2021 | TV-MA | 2 Seasons | International TV Shows, Romantic TV Shows, TV ... | In a city of coaching centers known to train I... |
Since some columns have nested values, will unnest them and prepare final dataset
## Unnesting director column
dir_constraint=netflix['director'].apply(lambda x: str(x).split(', ')).tolist()
df1 = pd.DataFrame(dir_constraint, index = netflix['title'])
df1 = df1.stack()
df1 = pd.DataFrame(df1.reset_index())
df1.rename(columns={0:'Directors'},inplace=True)
df1 = df1.drop(['level_1'],axis=1)
df1.head(10)
| title | Directors | |
|---|---|---|
| 0 | Dick Johnson Is Dead | Kirsten Johnson |
| 1 | Blood & Water | nan |
| 2 | Ganglands | Julien Leclercq |
| 3 | Jailbirds New Orleans | nan |
| 4 | Kota Factory | nan |
| 5 | Midnight Mass | Mike Flanagan |
| 6 | My Little Pony: A New Generation | Robert Cullen |
| 7 | My Little Pony: A New Generation | José Luis Ucha |
| 8 | Sankofa | Haile Gerima |
| 9 | The Great British Baking Show | Andy Devonshire |
## Unnesting - cast column
cast_constraint=netflix['cast'].apply(lambda x: str(x).split(', ')).tolist()
df2 = pd.DataFrame(cast_constraint, index = netflix['title'])
df2 = df2.stack()
df2 = pd.DataFrame(df2.reset_index())
df2.rename(columns={0:'Actors'},inplace=True)
df2 = df2.drop(['level_1'],axis=1)
df2.head(10)
| title | Actors | |
|---|---|---|
| 0 | Dick Johnson Is Dead | nan |
| 1 | Blood & Water | Ama Qamata |
| 2 | Blood & Water | Khosi Ngema |
| 3 | Blood & Water | Gail Mabalane |
| 4 | Blood & Water | Thabang Molaba |
| 5 | Blood & Water | Dillon Windvogel |
| 6 | Blood & Water | Natasha Thahane |
| 7 | Blood & Water | Arno Greeff |
| 8 | Blood & Water | Xolile Tshabalala |
| 9 | Blood & Water | Getmore Sithole |
netflix.head()
| Unnamed: 0 | show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | s1 | Movie | Dick Johnson Is Dead | Kirsten Johnson | NaN | United States | 25-Sep-21 | 2020 | PG-13 | 90 min | Documentaries | As her father nears the end of his life, filmm... |
| 1 | 1 | s2 | TV Show | Blood & Water | NaN | Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban... | South Africa | 24-Sep-21 | 2021 | TV-MA | 2 Seasons | International TV Shows, TV Dramas, TV Mysteries | After crossing paths at a party, a Cape Town t... |
| 2 | 2 | s3 | TV Show | Ganglands | Julien Leclercq | Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi... | NaN | 24-Sep-21 | 2021 | TV-MA | 1 Season | Crime TV Shows, International TV Shows, TV Act... | To protect his family from a powerful drug lor... |
| 3 | 3 | s4 | TV Show | Jailbirds New Orleans | NaN | NaN | NaN | 24-Sep-21 | 2021 | TV-MA | 1 Season | Docuseries, Reality TV | Feuds, flirtations and toilet talk go down amo... |
| 4 | 4 | s5 | TV Show | Kota Factory | NaN | Mayur More, Jitendra Kumar, Ranjan Raj, Alam K... | India | 24-Sep-21 | 2021 | TV-MA | 2 Seasons | International TV Shows, Romantic TV Shows, TV ... | In a city of coaching centers known to train I... |
## Unnesting - listed_in column
listed_constraint=netflix['listed_in'].apply(lambda x: str(x).split(', ')).tolist()
df3 = pd.DataFrame(listed_constraint, index = netflix['title'])
df3 = df3.stack()
df3 = pd.DataFrame(df3.reset_index())
df3.rename(columns={0:'Genre'},inplace=True)
df3 = df3.drop(['level_1'],axis=1)
df3.head(10)
| title | Genre | |
|---|---|---|
| 0 | Dick Johnson Is Dead | Documentaries |
| 1 | Blood & Water | International TV Shows |
| 2 | Blood & Water | TV Dramas |
| 3 | Blood & Water | TV Mysteries |
| 4 | Ganglands | Crime TV Shows |
| 5 | Ganglands | International TV Shows |
| 6 | Ganglands | TV Action & Adventure |
| 7 | Jailbirds New Orleans | Docuseries |
| 8 | Jailbirds New Orleans | Reality TV |
| 9 | Kota Factory | International TV Shows |
## Unnesting - country column
country_constraint=netflix['country'].apply(lambda x: str(x).split(', ')).tolist()
df4 = pd.DataFrame(country_constraint, index = netflix['title'])
df4 = df4.stack()
df4 = pd.DataFrame(df4.reset_index())
df4.rename(columns={0:'Country'},inplace=True)
df4 = df4.drop(['level_1'],axis=1)
df4.head(10)
| title | Country | |
|---|---|---|
| 0 | Dick Johnson Is Dead | United States |
| 1 | Blood & Water | South Africa |
| 2 | Ganglands | nan |
| 3 | Jailbirds New Orleans | nan |
| 4 | Kota Factory | India |
| 5 | Midnight Mass | nan |
| 6 | My Little Pony: A New Generation | nan |
| 7 | Sankofa | United States |
| 8 | Sankofa | Ghana |
| 9 | Sankofa | Burkina Faso |
Collate all the unnested dataframes
df5 = df2.merge(df1,on=['title'],how='inner')
df6 = df5.merge(df3,on=['title'],how='inner')
df7 = df6.merge(df4,on=['title'],how='inner')
df7.head()
| title | Actors | Directors | Genre | Country | |
|---|---|---|---|---|---|
| 0 | Dick Johnson Is Dead | nan | Kirsten Johnson | Documentaries | United States |
| 1 | Blood & Water | Ama Qamata | nan | International TV Shows | South Africa |
| 2 | Blood & Water | Ama Qamata | nan | TV Dramas | South Africa |
| 3 | Blood & Water | Ama Qamata | nan | TV Mysteries | South Africa |
| 4 | Blood & Water | Khosi Ngema | nan | International TV Shows | South Africa |
df7
| title | Actors | Directors | Genre | Country | |
|---|---|---|---|---|---|
| 0 | Dick Johnson Is Dead | nan | Kirsten Johnson | Documentaries | United States |
| 1 | Blood & Water | Ama Qamata | nan | International TV Shows | South Africa |
| 2 | Blood & Water | Ama Qamata | nan | TV Dramas | South Africa |
| 3 | Blood & Water | Ama Qamata | nan | TV Mysteries | South Africa |
| 4 | Blood & Water | Khosi Ngema | nan | International TV Shows | South Africa |
| ... | ... | ... | ... | ... | ... |
| 203158 | Zubaan | Anita Shabdish | Mozez Singh | International Movies | India |
| 203159 | Zubaan | Anita Shabdish | Mozez Singh | Music & Musicals | India |
| 203160 | Zubaan | Chittaranjan Tripathy | Mozez Singh | Dramas | India |
| 203161 | Zubaan | Chittaranjan Tripathy | Mozez Singh | International Movies | India |
| 203162 | Zubaan | Chittaranjan Tripathy | Mozez Singh | Music & Musicals | India |
203163 rows × 5 columns
df7.shape
(203163, 5)
merging unnested data with the given dataframe
netflix = df7.merge(netflix[['show_id', 'type', 'title', 'date_added',
'release_year', 'rating', 'duration']],on=['title'],how='left')
netflix.head()
| title | Actors | Directors | Genre | Country | show_id | type | date_added | release_year | rating | duration | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Dick Johnson Is Dead | nan | Kirsten Johnson | Documentaries | United States | s1 | Movie | 25-Sep-21 | 2020 | PG-13 | 90 min |
| 1 | Blood & Water | Ama Qamata | nan | International TV Shows | South Africa | s2 | TV Show | 24-Sep-21 | 2021 | TV-MA | 2 Seasons |
| 2 | Blood & Water | Ama Qamata | nan | TV Dramas | South Africa | s2 | TV Show | 24-Sep-21 | 2021 | TV-MA | 2 Seasons |
| 3 | Blood & Water | Ama Qamata | nan | TV Mysteries | South Africa | s2 | TV Show | 24-Sep-21 | 2021 | TV-MA | 2 Seasons |
| 4 | Blood & Water | Khosi Ngema | nan | International TV Shows | South Africa | s2 | TV Show | 24-Sep-21 | 2021 | TV-MA | 2 Seasons |
netflix.shape
(204539, 11)
Final Dataset will have around 2 Lakh rows and 11 columns
netflix.isna().sum()
title 0 Actors 0 Directors 0 Genre 0 Country 0 show_id 0 type 0 date_added 158 release_year 0 rating 67 duration 3 dtype: int64
There were some missing values in date_added and rating will treat them
total_null = netflix.isnull().sum().sort_values(ascending = False)
percent = ((netflix.isnull().sum()/netflix.isnull().count())*100).sort_values(ascending = False)
print("Total records = ", netflix.shape[0])
missing_data = pd.concat([total_null,percent.round(2)],axis=1,keys=['Total Missing','In Percent'])
missing_data.head(10)
Total records = 204539
| Total Missing | In Percent | |
|---|---|---|
| date_added | 158 | 0.08 |
| rating | 67 | 0.03 |
| duration | 3 | 0.00 |
| title | 0 | 0.00 |
| Actors | 0 | 0.00 |
| Directors | 0 | 0.00 |
| Genre | 0 | 0.00 |
| Country | 0 | 0.00 |
| show_id | 0 | 0.00 |
| type | 0 | 0.00 |
Above table gives missing values summary in absolute value and in Percentage, date added has the maximum missing values
Treating Missing values
import numpy as np
## some columns having nan which is missing value, we have to replace
netflix['Actors'].replace(['nan'],['Unknown Actor'],inplace=True)
netflix['Directors'].replace(['nan'],['Unknown Director'],inplace=True)
netflix['Country'].replace(['nan'],[np.nan],inplace=True)
netflix.head()
| title | Actors | Directors | Genre | Country | show_id | type | date_added | release_year | rating | duration | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Dick Johnson Is Dead | Unknown Actor | Kirsten Johnson | Documentaries | United States | s1 | Movie | 25-Sep-21 | 2020 | PG-13 | 90 min |
| 1 | Blood & Water | Ama Qamata | Unknown Director | International TV Shows | South Africa | s2 | TV Show | 24-Sep-21 | 2021 | TV-MA | 2 Seasons |
| 2 | Blood & Water | Ama Qamata | Unknown Director | TV Dramas | South Africa | s2 | TV Show | 24-Sep-21 | 2021 | TV-MA | 2 Seasons |
| 3 | Blood & Water | Ama Qamata | Unknown Director | TV Mysteries | South Africa | s2 | TV Show | 24-Sep-21 | 2021 | TV-MA | 2 Seasons |
| 4 | Blood & Water | Khosi Ngema | Unknown Director | International TV Shows | South Africa | s2 | TV Show | 24-Sep-21 | 2021 | TV-MA | 2 Seasons |
total_null = netflix.isnull().sum().sort_values(ascending = False)
percent = ((netflix.isnull().sum()/netflix.isnull().count())*100).sort_values(ascending = False)
print("Total records = ", netflix.shape[0])
missing_data = pd.concat([total_null,percent.round(2)],axis=1,keys=['Total Missing','In Percent'])
missing_data.head(10)
Total records = 204539
| Total Missing | In Percent | |
|---|---|---|
| Country | 12497 | 6.11 |
| date_added | 158 | 0.08 |
| rating | 67 | 0.03 |
| duration | 3 | 0.00 |
| title | 0 | 0.00 |
| Actors | 0 | 0.00 |
| Directors | 0 | 0.00 |
| Genre | 0 | 0.00 |
| show_id | 0 | 0.00 |
| type | 0 | 0.00 |
after replacing string nan with np.nan, actual null values of country went upto 5.89 %
netflix[netflix['duration'].isnull()]
| title | Actors | Directors | Genre | Country | show_id | type | date_added | release_year | rating | duration | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 129171 | Louis C.K. 2017 | Louis C.K. | Louis C.K. | Movies | United States | s5542 | Movie | 4-Apr-17 | 2017 | 74 min | NaN |
| 134237 | Louis C.K.: Hilarious | Louis C.K. | Louis C.K. | Movies | United States | s5795 | Movie | 16-Sep-16 | 2010 | 84 min | NaN |
| 134371 | Louis C.K.: Live at the Comedy Store | Louis C.K. | Louis C.K. | Movies | United States | s5814 | Movie | 15-Aug-16 | 2015 | 66 min | NaN |
duration and rating columns got messed up and values got exchanged. Will add rating column values into duration column missing values
netflix.loc[netflix['duration'].isnull(),'duration'] = netflix.loc[netflix['duration'].isnull(),'duration'].fillna(netflix['rating'])
netflix.loc[netflix['rating'].str.contains('min', na=False),'rating'] = 'NR'
netflix['rating'].fillna('NR',inplace=True)
netflix.isnull().sum()
title 0 Actors 0 Directors 0 Genre 0 Country 12497 show_id 0 type 0 date_added 158 release_year 0 rating 0 duration 0 dtype: int64
Filling missing values of date added column with mode value with respective release years
for i in netflix[netflix['date_added'].isnull()]['release_year'].unique():
date = netflix[netflix['release_year'] == i]['date_added'].mode().values[0]
netflix.loc[netflix['release_year'] == i,'date_added'] = netflix.loc[netflix['release_year']==i,'date_added'].fillna(date)
netflix.isnull().sum()
title 0 Actors 0 Directors 0 Genre 0 Country 12497 show_id 0 type 0 date_added 0 release_year 0 rating 0 duration 0 dtype: int64
Filling missing values of country column with mode value with respective directors
for i in netflix[netflix['Country'].isnull()]['Directors'].unique():
if i in netflix[~netflix['Country'].isnull()]['Directors'].unique():
country = netflix[netflix['Directors'] == i]['Country'].mode().values[0]
netflix.loc[netflix['Directors'] == i,'Country'] = netflix.loc[netflix['Directors'] == i,'Country'].fillna(country)
netflix.isnull().sum()
title 0 Actors 0 Directors 0 Genre 0 Country 4276 show_id 0 type 0 date_added 0 release_year 0 rating 0 duration 0 dtype: int64
remaing missing values will be replaced using actors column
for i in netflix[netflix['Country'].isnull()]['Actors'].unique():
if i in netflix[~netflix['Country'].isnull()]['Actors'].unique():
imp = netflix[netflix['Actors'] == i]['Country'].mode().values[0]
netflix.loc[netflix['Actors'] == i,'Country'] = netflix.loc[netflix['Actors']==i,'Country'].fillna(imp)
netflix.isnull().sum()
title 0 Actors 0 Directors 0 Genre 0 Country 2069 show_id 0 type 0 date_added 0 release_year 0 rating 0 duration 0 dtype: int64
netflix['Country'].fillna('Unknown Country',inplace=True)
netflix.isnull().sum()
title 0 Actors 0 Directors 0 Genre 0 Country 0 show_id 0 type 0 date_added 0 release_year 0 rating 0 duration 0 dtype: int64
Now missing values handling is over, will deep dive into data analysis
netflix.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 204539 entries, 0 to 204538 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 title 204539 non-null object 1 Actors 204539 non-null object 2 Directors 204539 non-null object 3 Genre 204539 non-null object 4 Country 204539 non-null object 5 show_id 204539 non-null object 6 type 204539 non-null object 7 date_added 204539 non-null object 8 release_year 204539 non-null int64 9 rating 204539 non-null object 10 duration 204539 non-null object dtypes: int64(1), object(10) memory usage: 26.8+ MB
#converting date added data type(object format) into datetime format to extract years, month
netflix["date_added"] = pd.to_datetime(netflix['date_added'])
netflix.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 204539 entries, 0 to 204538 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 title 204539 non-null object 1 Actors 204539 non-null object 2 Directors 204539 non-null object 3 Genre 204539 non-null object 4 Country 204539 non-null object 5 show_id 204539 non-null object 6 type 204539 non-null object 7 date_added 204539 non-null datetime64[ns] 8 release_year 204539 non-null int64 9 rating 204539 non-null object 10 duration 204539 non-null object dtypes: datetime64[ns](1), int64(1), object(9) memory usage: 26.8+ MB
## Removing the min string in duration column
netflix ['duration'] = netflix['duration'].str.replace(" min","")
netflix.head(6)
| title | Actors | Directors | Genre | Country | show_id | type | date_added | release_year | rating | duration | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Dick Johnson Is Dead | Unknown Actor | Kirsten Johnson | Documentaries | United States | s1 | Movie | 2021-09-25 | 2020 | PG-13 | 90 |
| 1 | Blood & Water | Ama Qamata | Unknown Director | International TV Shows | South Africa | s2 | TV Show | 2021-09-24 | 2021 | TV-MA | 2 Seasons |
| 2 | Blood & Water | Ama Qamata | Unknown Director | TV Dramas | South Africa | s2 | TV Show | 2021-09-24 | 2021 | TV-MA | 2 Seasons |
| 3 | Blood & Water | Ama Qamata | Unknown Director | TV Mysteries | South Africa | s2 | TV Show | 2021-09-24 | 2021 | TV-MA | 2 Seasons |
| 4 | Blood & Water | Khosi Ngema | Unknown Director | International TV Shows | South Africa | s2 | TV Show | 2021-09-24 | 2021 | TV-MA | 2 Seasons |
| 5 | Blood & Water | Khosi Ngema | Unknown Director | TV Dramas | South Africa | s2 | TV Show | 2021-09-24 | 2021 | TV-MA | 2 Seasons |
netflix['duration2'] = netflix.duration.copy()
netflix_ = netflix.copy()
netflix_.loc[netflix_['duration2'].str.contains('Season'),'duration2'] = 0
netflix_['duration2'] = netflix_.duration2.astype('int')
netflix_.head()
| title | Actors | Directors | Genre | Country | show_id | type | date_added | release_year | rating | duration | duration2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Dick Johnson Is Dead | Unknown Actor | Kirsten Johnson | Documentaries | United States | s1 | Movie | 2021-09-25 | 2020 | PG-13 | 90 | 90 |
| 1 | Blood & Water | Ama Qamata | Unknown Director | International TV Shows | South Africa | s2 | TV Show | 2021-09-24 | 2021 | TV-MA | 2 Seasons | 0 |
| 2 | Blood & Water | Ama Qamata | Unknown Director | TV Dramas | South Africa | s2 | TV Show | 2021-09-24 | 2021 | TV-MA | 2 Seasons | 0 |
| 3 | Blood & Water | Ama Qamata | Unknown Director | TV Mysteries | South Africa | s2 | TV Show | 2021-09-24 | 2021 | TV-MA | 2 Seasons | 0 |
| 4 | Blood & Water | Khosi Ngema | Unknown Director | International TV Shows | South Africa | s2 | TV Show | 2021-09-24 | 2021 | TV-MA | 2 Seasons | 0 |
netflix_.duration2.describe()
count 204539.000000 mean 77.503151 std 52.443402 min 0.000000 25% 0.000000 50% 95.000000 75% 112.000000 max 312.000000 Name: duration2, dtype: float64
netflix_.T.apply(lambda x: x.nunique(), axis=1)
title 8804 Actors 36440 Directors 4994 Genre 42 Country 127 show_id 8807 type 2 date_added 1714 release_year 74 rating 14 duration 220 duration2 206 dtype: int64
Actors has the most unique values follwed by title and directors
profile_clean = prf.ProfileReport(netflix)
profile_clean
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
profile_clean.to_file(output_file='Netflix_After_PreProcessing.html')
Export report to file: 0%| | 0/1 [00:00<?, ?it/s]
What different types of show or movie are uploaded on Netflix?
Correlation between the features
What different types of show or movie are uploaded on Netflix?
##method1
netflix.groupby('type')['title'].count().sort_values(ascending=False)
type Movie 147799 TV Show 56740 Name: title, dtype: int64
netflix['type'].value_counts().to_frame('values_count')
| values_count | |
|---|---|
| Movie | 147799 |
| TV Show | 56740 |
netflix.groupby(["type","release_year"])["title"].agg(pd.Series.mode)
type release_year
Movie 1942 The Battle of Midway
1943 [Undercover: How to Operate Behind Enemy Lines...
1944 Tunisian Victory
1945 Know Your Enemy - Japan
1946 Let There Be Light
...
TV Show 2017 Narcos
2018 9-Feb
2019 Creeped Out
2020 The Eddy
2021 Navarasa
Name: title, Length: 119, dtype: object
Univariate analysis of duration column
## Histogram to see the distribution of duration
plt.style.use('dark_background')
plt.figure(figsize=(10,2))
sns.displot(netflix_['duration2'])
<seaborn.axisgrid.FacetGrid at 0x2482a8a59d0>
<Figure size 1000x200 with 0 Axes>
Most of the values is around 100 and basically 0 is the TV shows
bins = [-1,1,50,80,100,120,150,200,315]
labels = ['<1','1-50','50-80','80-100','100-120','120-150','150-200','200-315']
netflix_['duration2'] = pd.cut(netflix_['duration2'],bins = bins, labels = labels )
netflix_.head()
| title | Actors | Directors | Genre | Country | show_id | type | date_added | release_year | rating | duration | duration2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Dick Johnson Is Dead | Unknown Actor | Kirsten Johnson | Documentaries | United States | s1 | Movie | 2021-09-25 | 2020 | PG-13 | 90 | 80-100 |
| 1 | Blood & Water | Ama Qamata | Unknown Director | International TV Shows | South Africa | s2 | TV Show | 2021-09-24 | 2021 | TV-MA | 2 Seasons | <1 |
| 2 | Blood & Water | Ama Qamata | Unknown Director | TV Dramas | South Africa | s2 | TV Show | 2021-09-24 | 2021 | TV-MA | 2 Seasons | <1 |
| 3 | Blood & Water | Ama Qamata | Unknown Director | TV Mysteries | South Africa | s2 | TV Show | 2021-09-24 | 2021 | TV-MA | 2 Seasons | <1 |
| 4 | Blood & Water | Khosi Ngema | Unknown Director | International TV Shows | South Africa | s2 | TV Show | 2021-09-24 | 2021 | TV-MA | 2 Seasons | <1 |
netflix_.loc[~netflix_['duration'].str.contains('Season'),'duration'] = netflix_.loc[~netflix_['duration'].str.contains('Season'),'duration2']
netflix_.drop(['duration2'],axis=1,inplace=True)
netflix_.head()
| title | Actors | Directors | Genre | Country | show_id | type | date_added | release_year | rating | duration | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Dick Johnson Is Dead | Unknown Actor | Kirsten Johnson | Documentaries | United States | s1 | Movie | 2021-09-25 | 2020 | PG-13 | 80-100 |
| 1 | Blood & Water | Ama Qamata | Unknown Director | International TV Shows | South Africa | s2 | TV Show | 2021-09-24 | 2021 | TV-MA | 2 Seasons |
| 2 | Blood & Water | Ama Qamata | Unknown Director | TV Dramas | South Africa | s2 | TV Show | 2021-09-24 | 2021 | TV-MA | 2 Seasons |
| 3 | Blood & Water | Ama Qamata | Unknown Director | TV Mysteries | South Africa | s2 | TV Show | 2021-09-24 | 2021 | TV-MA | 2 Seasons |
| 4 | Blood & Water | Khosi Ngema | Unknown Director | International TV Shows | South Africa | s2 | TV Show | 2021-09-24 | 2021 | TV-MA | 2 Seasons |
extracting day, week, year, month from date added column helps in checking which month got more TV shows like that
from datetime import datetime
from dateutil.parser import parse
netflix_["year_added"] = netflix_['date_added'].dt.year
netflix_["year_added"] = netflix_["year_added"].astype("Int64")
netflix_["month_added"] = netflix_['date_added'].dt.month
netflix_['month_name'] = netflix['date_added'].dt.month_name()
netflix_["month_added"] = netflix_["month_added"].astype("Int64")
netflix_["day_added"] = netflix_['date_added'].dt.day
netflix_["day_added"] = netflix_["day_added"].astype("Int64")
netflix_['Weekday_added'] = netflix_['date_added'].apply(lambda x: parse(str(x)).strftime("%A"))
netflix_.head()
| title | Actors | Directors | Genre | Country | show_id | type | date_added | release_year | rating | duration | year_added | month_added | month_name | day_added | Weekday_added | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Dick Johnson Is Dead | Unknown Actor | Kirsten Johnson | Documentaries | United States | s1 | Movie | 2021-09-25 | 2020 | PG-13 | 80-100 | 2021 | 9 | September | 25 | Saturday |
| 1 | Blood & Water | Ama Qamata | Unknown Director | International TV Shows | South Africa | s2 | TV Show | 2021-09-24 | 2021 | TV-MA | 2 Seasons | 2021 | 9 | September | 24 | Friday |
| 2 | Blood & Water | Ama Qamata | Unknown Director | TV Dramas | South Africa | s2 | TV Show | 2021-09-24 | 2021 | TV-MA | 2 Seasons | 2021 | 9 | September | 24 | Friday |
| 3 | Blood & Water | Ama Qamata | Unknown Director | TV Mysteries | South Africa | s2 | TV Show | 2021-09-24 | 2021 | TV-MA | 2 Seasons | 2021 | 9 | September | 24 | Friday |
| 4 | Blood & Water | Khosi Ngema | Unknown Director | International TV Shows | South Africa | s2 | TV Show | 2021-09-24 | 2021 | TV-MA | 2 Seasons | 2021 | 9 | September | 24 | Friday |
netflix_['title'] = netflix_['title'].str.replace(r"\(.*\)","")
netflix_.head()
C:\Users\modem\AppData\Local\Temp\ipykernel_38720\2120036043.py:1: FutureWarning: The default value of regex will change from True to False in a future version. netflix_['title'] = netflix_['title'].str.replace(r"\(.*\)","")
| title | Actors | Directors | Genre | Country | show_id | type | date_added | release_year | rating | duration | year_added | month_added | month_name | day_added | Weekday_added | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Dick Johnson Is Dead | Unknown Actor | Kirsten Johnson | Documentaries | United States | s1 | Movie | 2021-09-25 | 2020 | PG-13 | 80-100 | 2021 | 9 | September | 25 | Saturday |
| 1 | Blood & Water | Ama Qamata | Unknown Director | International TV Shows | South Africa | s2 | TV Show | 2021-09-24 | 2021 | TV-MA | 2 Seasons | 2021 | 9 | September | 24 | Friday |
| 2 | Blood & Water | Ama Qamata | Unknown Director | TV Dramas | South Africa | s2 | TV Show | 2021-09-24 | 2021 | TV-MA | 2 Seasons | 2021 | 9 | September | 24 | Friday |
| 3 | Blood & Water | Ama Qamata | Unknown Director | TV Mysteries | South Africa | s2 | TV Show | 2021-09-24 | 2021 | TV-MA | 2 Seasons | 2021 | 9 | September | 24 | Friday |
| 4 | Blood & Water | Khosi Ngema | Unknown Director | International TV Shows | South Africa | s2 | TV Show | 2021-09-24 | 2021 | TV-MA | 2 Seasons | 2021 | 9 | September | 24 | Friday |
netflix_genre=netflix_.groupby(['Genre']).agg({"title":"nunique"}).reset_index().sort_values(by=['title'],ascending=False)[:15]
plt.figure(figsize=(15,6))
sns.barplot(x = "Genre",y = 'title', data = netflix_genre)
plt.xticks(rotation = 60)
plt.title('Top 15 Genres')
plt.show()
International Movies, Dramas and Comedies are the most popular
netflix_pie = netflix_.groupby(['type']).agg({'title':'nunique'}).reset_index()
netflix_pie
| type | title | |
|---|---|---|
| 0 | Movie | 6113 |
| 1 | TV Show | 2675 |
colors = sns.color_palette('bright')[0:5]
plt.figure(figsize=(10,4))
plt.pie(netflix_pie['title'], labels = netflix_pie['type'], colors = colors, autopct='%.0f%%')
plt.title('Percentage of movies and TV shows')
plt.show()
We have 70:30 ratio of Movies and TV Shows in our data
netflix_['Country'] = netflix_['Country'].str.replace(',', '')
netflix_.head()
| title | Actors | Directors | Genre | Country | show_id | type | date_added | release_year | rating | duration | year_added | month_added | month_name | day_added | Weekday_added | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Dick Johnson Is Dead | Unknown Actor | Kirsten Johnson | Documentaries | United States | s1 | Movie | 2021-09-25 | 2020 | PG-13 | 80-100 | 2021 | 9 | September | 25 | Saturday |
| 1 | Blood & Water | Ama Qamata | Unknown Director | International TV Shows | South Africa | s2 | TV Show | 2021-09-24 | 2021 | TV-MA | 2 Seasons | 2021 | 9 | September | 24 | Friday |
| 2 | Blood & Water | Ama Qamata | Unknown Director | TV Dramas | South Africa | s2 | TV Show | 2021-09-24 | 2021 | TV-MA | 2 Seasons | 2021 | 9 | September | 24 | Friday |
| 3 | Blood & Water | Ama Qamata | Unknown Director | TV Mysteries | South Africa | s2 | TV Show | 2021-09-24 | 2021 | TV-MA | 2 Seasons | 2021 | 9 | September | 24 | Friday |
| 4 | Blood & Water | Khosi Ngema | Unknown Director | International TV Shows | South Africa | s2 | TV Show | 2021-09-24 | 2021 | TV-MA | 2 Seasons | 2021 | 9 | September | 24 | Friday |
netflix_country = netflix_.groupby(['Country']).agg({'title':'nunique'}).reset_index().sort_values(by=['title'],ascending=False)[:10]
plt.figure(figsize=(15,6))
sns.barplot(y = "Country",x = 'title', data = netflix_country)
plt.xticks(rotation = 60)
plt.title('Top 10 Countries for content creation')
plt.show()
US,India,UK,Canada and France are leading countries in Content Creation on Netflix
netflix_rating = netflix_.groupby(['rating']).agg({'title':'nunique'}).reset_index().sort_values(by=['title'],ascending=False)[:10]
plt.figure(figsize=(15,6))
sns.barplot(y = "rating",x = 'title', data = netflix_rating)
plt.xticks(rotation = 60)
plt.title('Top 10 rating types')
plt.show()
Most of the highly rated content on Netflix is intended for Mature Audiences
netflix_duration = netflix_.groupby(['duration']).agg({'title':'nunique'}).reset_index().sort_values(by=['title'],ascending=False)[:10]
plt.figure(figsize=(15,6))
sns.barplot(y = "duration",x = 'title', data = netflix_duration)
plt.xticks(rotation = 60)
plt.title('Top 10 duaration categories')
plt.show()
The duration of Most Watched content in our whole data is 80-100 mins. These must be movies and Shows having only 1 Season.
netflix_actors = netflix_.groupby(['Actors']).agg({'title':'nunique'}).reset_index().sort_values(by=['title'],ascending=False)[:15]
netflix_actors = netflix_actors[netflix_actors['Actors']!='Unknown Actor']
plt.figure(figsize=(15,6))
sns.barplot(y = "Actors",x = 'title', data = netflix_actors )
plt.xticks(rotation = 60)
plt.title('Top 15 most popular Actors')
plt.show()
Anupam Kher,SRK,Julie Tejwani, Naseeruddin Shah and Takahiro Sakurai occupy the top stop in Most Watched content.
netflix_directors = netflix_.groupby(['Directors']).agg({'title':'nunique'}).reset_index().sort_values(by=['title'],ascending=False)[:15]
netflix_directors = netflix_directors[netflix_directors['Directors']!='Unknown Director']
plt.figure(figsize=(15,6))
sns.barplot(y = "Directors",x = 'title', data = netflix_directors )
plt.xticks(rotation = 60)
plt.title('Top 15 most popular Directors')
plt.show
<function matplotlib.pyplot.show(close=None, block=None)>
Rajiv Chilaka, Jan Suter and Raul Campos are the most popular directors across Netflix
netflix_.head()
| title | Actors | Directors | Genre | Country | show_id | type | date_added | release_year | rating | duration | year_added | month_added | month_name | day_added | Weekday_added | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Dick Johnson Is Dead | Unknown Actor | Kirsten Johnson | Documentaries | United States | s1 | Movie | 2021-09-25 | 2020 | PG-13 | 80-100 | 2021 | 9 | September | 25 | Saturday |
| 1 | Blood & Water | Ama Qamata | Unknown Director | International TV Shows | South Africa | s2 | TV Show | 2021-09-24 | 2021 | TV-MA | 2 Seasons | 2021 | 9 | September | 24 | Friday |
| 2 | Blood & Water | Ama Qamata | Unknown Director | TV Dramas | South Africa | s2 | TV Show | 2021-09-24 | 2021 | TV-MA | 2 Seasons | 2021 | 9 | September | 24 | Friday |
| 3 | Blood & Water | Ama Qamata | Unknown Director | TV Mysteries | South Africa | s2 | TV Show | 2021-09-24 | 2021 | TV-MA | 2 Seasons | 2021 | 9 | September | 24 | Friday |
| 4 | Blood & Water | Khosi Ngema | Unknown Director | International TV Shows | South Africa | s2 | TV Show | 2021-09-24 | 2021 | TV-MA | 2 Seasons | 2021 | 9 | September | 24 | Friday |
netflix_year = netflix_.groupby(['year_added']).agg({'title':'nunique'}).reset_index()
plt.figure(figsize=(15,6))
sns.lineplot(x = "year_added",y = 'title', data = netflix_year, color = 'red' )
plt.xticks(rotation = 60)
plt.title('movies/ TV shows across the years')
plt.show
<function matplotlib.pyplot.show(close=None, block=None)>
The Amount of Content across Netflix has increased from 2008 continuously till 2019. Then started decreasing from here(probably due to Covid)
fig = plt.figure(figsize = (15,5))
#plt.style.use('dark_background')
sns.countplot(data = netflix_,x = 'year_added',hue = 'type',palette ="Reds_r")
plt.title('Movies and TV Shows added added to Netflix by date ', fontsize=14)
Text(0.5, 1.0, 'Movies and TV Shows added added to Netflix by date ')
Over the years both TV shows and movie contents addtion has increased after 2020 its started declining may be due to Covid relief, Movies addtion is more compare to TV shows over the years
netflix_month = netflix_.groupby(['month_name', 'type']).agg({'title':'nunique'}).reset_index()
plt.figure(figsize=(15,6))
sns.lineplot(x = "month_name",y = 'title', data = netflix_month, color = 'red', hue = netflix_month.type )
plt.xticks(rotation = 60)
plt.title('movies/ TV shows added across the months')
plt.show
<function matplotlib.pyplot.show(close=None, block=None)>
for both TV shows and Movies best launch month remain same which is July followed by December
netflix_month = netflix_.groupby(['month_name']).agg({'title':'nunique'}).reset_index()
plt.figure(figsize=(15,6))
sns.lineplot(x = "month_name",y = 'title', data = netflix_month, color = 'red' )
plt.xticks(rotation = 60)
plt.title('movies/ TV shows added across the months')
plt.show
<function matplotlib.pyplot.show(close=None, block=None)>
In general most of the content get added in december and july month
netflix_day = netflix_.groupby(['day_added']).agg({'title':'nunique'}).reset_index()
plt.figure(figsize=(15,6))
sns.barplot(x = "day_added",y = 'title', data = netflix_day, color = 'red' )
plt.xticks(rotation = 60)
plt.title('movies/ TV shows added across each day')
plt.show
<function matplotlib.pyplot.show(close=None, block=None)>
It was evident that 1st of every month was when the most content was added.
netflix_weekday = netflix_.groupby(['Weekday_added', 'type']).agg({'title':'nunique'}).reset_index()
plt.figure(figsize=(15,6))
sns.lineplot(x = "Weekday_added",y = 'title', data = netflix_weekday, color = 'red' , hue = netflix_weekday.type)
plt.xticks(rotation = 60)
plt.title('movies/ TV shows added across weekdays')
plt.show
<function matplotlib.pyplot.show(close=None, block=None)>
netflix_weekday = netflix_.groupby(['Weekday_added']).agg({'title':'nunique'}).reset_index()
plt.figure(figsize=(15,6))
sns.lineplot(x = "Weekday_added",y = 'title', data = netflix_weekday, color = 'red' )
plt.xticks(rotation = 60)
plt.title('movies/ TV shows added across weekdays')
plt.show
<function matplotlib.pyplot.show(close=None, block=None)>
For content release on Netflix, Friday is the best day followed by Thursday
plt.figure(figsize=(15,6))
sns.boxplot(x='type', y='release_year', data=netflix_, )
sns.despine(left=True)
plt.title('Type of Show by Release Date')
plt.ylim(2000,2020)
(2000.0, 2020.0)
It sees tv shows have a more recent release_year. This means tv shows are releasing more in recent years
month_order = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September',
'October', 'November', 'December']
content = netflix_.groupby('year_added')['month_name'].value_counts().unstack().fillna(0)[month_order].T
plt.figure(figsize=(10,8))
plt.title("Number of months' content added per year")
sns.heatmap(content , cmap = 'Blues')
plt.show()
Most number of Movies and TV shows were added in November, 2019 and July, 2021
Fewer movies and TV shows were added from 2008 to 2015
plt.figure(figsize = (12,5))
sns.scatterplot(y = netflix_.index , x = netflix_.release_year , hue = netflix_.type , palette='Set2')
<AxesSubplot:xlabel='release_year'>
netflix_.groupby(['day_added']).agg({"title":"nunique"})
| title | |
|---|---|
| day_added | |
| 1 | 2219 |
| 2 | 325 |
| 3 | 151 |
| 4 | 175 |
| 5 | 231 |
| 6 | 210 |
| 7 | 190 |
| 8 | 201 |
| 9 | 148 |
| 10 | 214 |
| 11 | 149 |
| 12 | 181 |
| 13 | 175 |
| 14 | 198 |
| 15 | 688 |
| 16 | 289 |
| 17 | 180 |
| 18 | 205 |
| 19 | 243 |
| 20 | 248 |
| 21 | 190 |
| 22 | 230 |
| 23 | 182 |
| 24 | 159 |
| 25 | 196 |
| 26 | 205 |
| 27 | 195 |
| 28 | 190 |
| 29 | 140 |
| 30 | 210 |
| 31 | 274 |
It was evident that 1st of every month was when the most content was added.
netflix_shows = netflix_[netflix_['type']=='TV Show']
netflix_movies = netflix_[netflix_['type']=='Movie']
netflix_genre = netflix_shows.groupby(['Genre']).agg({"title":"nunique"}).reset_index().sort_values(by=['title'],ascending=False)[:15]
plt.figure(figsize = (15,6))
sns.barplot(y = "Genre",x = 'title', data = netflix_genre)
plt.xticks(rotation = 60)
plt.title('Top 15 Genres')
plt.show()
International TV Shows, Dramas and Comedy Genres are popular across TV Shows in Netflix
netflix_genre = netflix_movies.groupby(['Genre']).agg({"title":"nunique"}).reset_index().sort_values(by=['title'],ascending=False)[:15]
plt.figure(figsize = (15,6))
sns.barplot(y = "Genre",x = 'title', data = netflix_genre)
plt.xticks(rotation = 60)
plt.title('Top 15 Genres')
plt.show()
International Movies, Dramas and Comedy Genres are popular followed by Documentaries across Movies on Netflix
netflix_country = netflix_shows.groupby(['Country']).agg({'title':'nunique'}).reset_index().sort_values(by=['title'],ascending=False)[:10]
plt.figure(figsize=(15,6))
sns.barplot(y = "Country",x = 'title', data = netflix_country)
plt.xticks(rotation = 60)
plt.title('Top 10 Countries for content creation')
plt.show()
netflix_country = netflix_movies.groupby(['Country']).agg({'title':'nunique'}).reset_index().sort_values(by=['title'],ascending=False)[:10]
plt.figure(figsize=(15,6))
sns.barplot(y = "Country",x = 'title', data = netflix_country)
plt.xticks(rotation = 60)
plt.title('Top 10 Countries for content creation')
plt.show()
United States is leading across both TV Shows and Movies, UK also provides great content across TV Shows and Movies. Surprisingly India is much more prevalent in Movies as compared TV Shows.
Moreover the number of Movies created in India outweigh the sum of TV Shows and Movies across UK since India was rated as second in net sum of whole content across Netflix.
Over the years both TV shows and movie contents addtion has increased till 2020, but after 2020 its started declining may be due to Covid relief, number of Movies added is more compare to TV shows over the years
Most of the content get added in december and july month, for day wise, Friday is the best day followed by Thursday
It was evident that 1st of every month was when the most content was added.
Anupam Kher,SRK,Julie Tejwani, Naseeruddin Shah and Takahiro Sakurai occupy the top stop in Most Watched content.
Rajiv Chilaka, Jan Suter and Raul Campos are the most popular directors across Netflix
Rajiv Chilaka director producing more movies
Netflix is more focussing on movies compare to TV shows
There is a 70:30 ratio of Movies and TV Shows content in Netflix platform
International Movies, Dramas and Comedies are the most popular are most popular Genre
US,India,UK,Canada and France are leading countries in Content Creation on Netflix
Most of the highly rated content on Netflix is intended for Mature Audiences
The duration of Most Watched content in our whole data is 80-120 mins. These must be movies and Shows having only 1 Season.
United States is leading across both TV Shows and Movies, UK also provides great content across TV Shows and Movies. Surprisingly India is much more prevalent in Movies as compared TV Shows.
Moreover the number of Movies created in India outweigh the sum of TV Shows and Movies across UK since India was rated
The most popular Genres across the countries and in both TV Shows and Movies are Drama, Comedy and International TV Shows/Movies, so recommended to generate more content on these genres.
Add TV Shows/ movies in the month of July 1st or August 1st.
Add movies for Indian Audience, it has been declining since 2018.
While creating content, take into consideration the popular actors/directors for that country. Also take into account the director-actor combination which is highly recommended.
For audience 80-120 mins is the recommended length for movies.